Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data or can be downloaded from UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. The data is aggregated on daily basis and then extracted and added the corresponding weather and seasonal information.
The dataset consist of following variables:
- instant: record index
- dteday : date
- season : season (1:spring, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
We will use this dataset to predict the count of rental bikes(cnt variable) based upon the features we have seen above.
Source: The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data.
The purpose for this work is to understand how variables in the bike sharing dataset combine in complex relationships to predict the demand for bike and how attributes like like weather and time affect a quantitative outcome of bike rentals. The quantification of the relationship will assist in forecasting demand for bike rentals in the short term.
The data consists of CSV files containing daily and hourly summaries of bike rentals with the corresponding weather. We decided to focus on the daily data.
data = read.csv("day.csv")
str(data)
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
In the imported data the only categorical feature is the date, which we will remove since it is unique to each row. Some of the features should be categorical, so we will convert them to factors and assign meaningful levels.
data$season = as.factor(data$season)
levels(data$season) <- c("spring", "summer", "fall", "winter") # load it as a factor with levels
data$holiday = as.factor(data$holiday)
levels(data$holiday) = c("no", "yes") # load it as a factor with levels
data$workingday = as.factor(data$workingday)
levels(data$workingday) = c("no", "yes") # load it as a factor with levels
data$weathersit = as.factor(data$weathersit)
levels(data$weathersit) = c("Clearish", "Misty", "LightPrecip") # load it as a factor with levels
data$weekday = as.factor(data$weekday)
levels(data$weekday) = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat") # load it as a factor with levels
data$mnth = as.factor(data$mnth)
levels(data$mnth) = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") # load it as a factor with levels
str(data) # look at the structure of transformed data
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 1 2 3 4 5 6 7 1 2 ...
## $ workingday: Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 1 2 ...
## $ weathersit: Factor w/ 3 levels "Clearish","Misty",..: 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
We should note that although there are four levels to the Weathersit specified in the documentation, only three of those levels actually appear in the data.
sum(is.na(data))
## [1] 0
There are no fields missing in the data.
Before beginning the process of modelling the data we performed exploratory data analysis to gain insight into the distribution of features and their relationships. As this analysis was not directly related to the choice of a model it is included in the Appendix.
The exploratory analysis looked at the distributions of individual features, relationships between features and the response, correlation between features, and tried to find interactions between features as related to the response.
First we remove features that will not be used for our modeling. Instance and Date are unique for each row, and we also remove casual and registered as we are modeling total ridership.
data=data[,c(-1,-2,-14,-15)] # remove instance, date, causal, registered variable since those are not needed for our model.
We also create a function to calculate LOOCV-RMSE as we will be using this to evaluate our models.
# function to calculate leave out out cross validated rmse
calc_loocv_rmse = function(model) {
temp=(resid(model) / (1 - hatvalues(model)))^2
temp=temp[is.finite(temp)]
sqrt(mean(temp))
}
Now we will separate numeric and categorical features.
numerical <- unlist(lapply(data, is.numeric)) # contains boolean value against each variable indicating whether that variable is a numeric or not
We begin by creating a model using only the numerical predictors to model the target. Diagnostics for this model are in Figure 1.
data_numerical= data[, numerical] # get the target and all the numerical columns
bike_mod_num = lm(cnt ~ ., data = data_numerical) # model with all numerical variables
summary(bike_mod_num)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.7261027
calc_loocv_rmse(bike_mod_num) # get the loocv rmse
## [1] 1041.521
This model has a rather low adjusted \(R^2\) and the cross validated RMSE values are quite high, which indicates that the model is not explaining the response as well as it could.
We now examine the p-values for the coefficients of the model:
summary(bike_mod_num)$coefficients[, 'Pr(>|t|)']
## (Intercept) yr temp atemp hum
## 7.548072e-19 6.262707e-109 2.937966e-01 4.505973e-03 1.193052e-15
## windspeed
## 7.785849e-15
The above results show that the temp is not very significant predictor as it has a high p value, this can be attributed to the fact that temp and atemp were highly correlated as we saw in the EDA and may be because of that, the effect of one gets eaten up by other.
par(mfrow = c(2, 2))
plot(bike_mod_num,col = 'dodgerblue') # do diagnostics
The Fitted vs Residual plot shows a bit of non linear trend and the leverage plot also highlights some outliers.
bike_mod_all=lm(cnt~., data=data) # model with all the variables
summary(bike_mod_all)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423198
calc_loocv_rmse(bike_mod_all) # get the loocv rmse
## [1] 794.7767
Including the categorical features has substantially improved the adjusted \(R^2\) and lowered the loocv-rmse.
P-values for coefficients for the model:
summary(bike_mod_all)$coefficients[, 'Pr(>|t|)'] # get the cofficeints
## (Intercept) seasonsummer seasonfall
## 9.774832e-10 1.032402e-06 1.024766e-04
## seasonwinter yr mnthFeb
## 2.279871e-17 2.294966e-154 3.624432e-01
## mnthMar mnthApr mnthMay
## 1.084575e-03 6.881974e-02 6.145473e-03
## mnthJun mnthJul mnthAug
## 6.842301e-02 9.218536e-01 1.426395e-01
## mnthSep mnthOct mnthNov
## 1.651562e-04 3.178651e-02 6.132572e-01
## mnthDec holidayyes weekdayMon
## 6.230982e-01 1.129889e-03 5.318673e-02
## weekdayTue weekdayWed weekdayThu
## 3.981733e-03 4.136266e-04 3.498183e-04
## weekdayFri weekdaySat weathersitMisty
## 5.296026e-05 4.007227e-05 3.156422e-09
## weathersitLightPrecip temp atemp
## 5.386219e-22 4.152649e-02 2.222607e-01
## hum windspeed
## 2.013660e-07 2.091887e-11
The p-values indicate that not all of the variables are significant.
Diagnostic plots for the model:
par(mfrow = c(2, 2))
plot(bike_mod_all,col = 'dodgerblue') # do diagnostics
We still see some issues in the diagnostic plots.
Based upon the results, we want to examine the significance of several of the categorical variables:
bike_mod_w_month=lm(cnt~.-mnth, data=data) # model without month
bike_mod_w_weekday=lm(cnt~.-weekday, data=data) # model without weekday
bike_mod_w_workingday=lm(cnt~.-workingday, data=data) # model without workingday
anova(bike_mod_w_month,bike_mod_w_weekday,bike_mod_w_workingday,bike_mod_all) # do anova test
## Analysis of Variance Table
##
## Model 1: cnt ~ (season + yr + mnth + holiday + weekday + workingday +
## weathersit + temp + atemp + hum + windspeed) - mnth
## Model 2: cnt ~ (season + yr + mnth + holiday + weekday + workingday +
## weathersit + temp + atemp + hum + windspeed) - weekday
## Model 3: cnt ~ (season + yr + mnth + holiday + weekday + workingday +
## weathersit + temp + atemp + hum + windspeed) - workingday
## Model 4: cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit +
## temp + atemp + hum + windspeed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 713 472452877
## 2 707 428442679 6 44010198 12.3957 2.683e-13 ***
## 3 702 415401688 5 13040991 4.4077 0.0005852 ***
## 4 702 415401688 0 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test shows that month, weekday are significant predictors hence we can’t rule them out. Even though the month variable is statistically significant, it might be that just few levels are useful and rest of them does not help. We will use model selection schemes later to investigate that.
According to test results working day variable seems to be non significant and we can rule out this variable.
We will now re-fit the model excluding working day:
data_2= data[,c(-6)] # remove working day variable
bike_mod_all_2=lm(cnt~., data=data_2) # model with all remaining variable
summary(bike_mod_all_2)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423198
calc_loocv_rmse(bike_mod_all_2) # get the loocv rmse
## [1] 794.7767
Excluding working day had no effect on the \(R^2\) of the model, so we can safely remove that feature.
Now we will using the Variance Inflation Factor to determine if we have multi co linearity issues:
vif(bike_mod_all_2)
## seasonsummer seasonfall seasonwinter
## 7.496334 10.720030 7.455172
## yr mnthFeb mnthMar
## 1.046828 1.835947 2.624298
## mnthApr mnthMay mnthJun
## 5.704404 6.868018 7.423137
## mnthJul mnthAug mnthSep
## 9.443506 8.813082 6.542133
## mnthOct mnthNov mnthDec
## 5.594951 4.956867 3.183722
## holidayyes weekdayMon weekdayTue
## 1.121296 1.821782 1.730242
## weekdayWed weekdayThu weekdayFri
## 1.741363 1.743011 1.740119
## weekdaySat weathersitMisty weathersitLightPrecip
## 1.725530 1.642310 1.338411
## temp atemp hum
## 80.806689 70.036722 2.140362
## windspeed
## 1.273296
As we had suspected, temp and atemp have a high level of co linearity. We will now look at the partial correlation coefficient between temp and cnt:
temp_model_1 <- lm(temp ~ . - cnt, data = data_2)
temp_model_2 <- lm(cnt ~ . - temp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.0768418
While this is relatively small, as temp and atemp are highly correlated we should check the partial correlation coefficient after removing atemp:
temp_model_1 <- lm(temp ~ . - cnt - atemp, data = data_2)
temp_model_2 <- lm(cnt ~ . - temp - atemp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3801001
Now we will check the partial correlation coefficient between atemp and cnt, removing temp:
temp_model_1 <- lm(atemp ~ . - cnt - temp, data = data_2)
temp_model_2 <- lm(cnt ~ . - atemp - temp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3757927
These results indicate that temp is more correlated with cnt than atemp is, so we will remove atemp and leave temp:
data_3= data_2[,-8]
bike_mod_all_3=lm(cnt~., data=data_3)
summary(bike_mod_all_3)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8422094
calc_loocv_rmse(bike_mod_all_3) # get the loocv rmse
## [1] 789.3204
This change has slightly lowered the adjusted \(R^2\) as we would expect removing a predictor would, but has improved the LOOCV-RMSE, which indicates that it has improved the model.
As we have concerns about the year predictor, we will also look at the partial correlation coefficient between year and count:
yr_mod_0 <- lm(cnt ~ . - yr, data = data_3)
yr_mod_1 <- lm(yr ~ . - cnt, data = data_3)
cor(resid(yr_mod_0), resid(yr_mod_1))
## [1] 0.7942529
This is quite high which indicates that the year has a significant relationship with ridership. We have seen that the ridership seems to be increasing from year to year (with some seasonal cycles within that trend.) Since our data only includes two years this may cause problems with using the model to extrapolate to years that are not included in the training set. However, the high partial correlation coefficient indicates that year is an important predictor so we will leave it in the model.
Next we would want to also check for potential outliers , we have 3 ways of doing so :
We will be using Cooks Distance to identify any such outlier and see the effect of it on the model.
First will calculate the number of observations flagged by Cooks Distance:
sum(cooks.distance(bike_mod_all_3) > 4 / length(cooks.distance(bike_mod_all_3)))
## [1] 44
Now we will refit a model excluding the identified observations:
cokks_distance = cooks.distance(bike_mod_all_3)
bike_mod_all_4 = lm(cnt ~.,
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
summary(bike_mod_all_4)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9085799
calc_loocv_rmse(bike_mod_all_4) # get the loocv rmse
## [1] 586.5534
Removing these outliers resulted in a substantial increase in \(R^2\) and a substantial decrease in LOOCV-RMSE.
par(mfrow = c(2, 2))
plot(bike_mod_all_4,col = 'dodgerblue') # do diagnostics
In the Residuals vs Fitted plot we see that the residuals start to look more normal, although we do still see some non-linear patterns which indicate that we may want to try some higher order terms or transformations.
The Residuals vs Leverage plot also looks much neater and the Normal Q-Q plot has also improved.
We will now evaluate including interactions:
bike_mod_all_5 = lm(cnt ~.^2,
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
summary(bike_mod_all_5)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9466695
Including all possible interactions resulted in a substantial improvement in adjusted \(R^2\), however we can not be sure if this improvement is due to the model or merely due to the inclusion of additional predictors.
calc_loocv_rmse(bike_mod_all_5) # get the loocv rmse
## [1] 759.5543
The LOOCV-RMSE has increased, which indicates that the additional terms may have resulted in over fitting the data.
par(mfrow = c(2, 2))
plot(bike_mod_all_5,col = 'dodgerblue') # do diagnostics
length(coef(bike_mod_all_5)) # get the number of params
## [1] 305
The residual vs fitted plot looks random with errors randomly distributed around 0 line, there are still some outliers which are getting highlighted on the plot.
The Q-Q plot looks more normal.
The leverage plot shows some indication of potential new outliers.
Since the model including all possible interactions increased the LOOCV-RMSE, indicating that it was over fitting to the training data, we will perform a backwards AIC step search to remove the non-significant terms.
bike_mod_all_6=step(bike_mod_all_5, trace=0, direction="backward")
length(coef(bike_mod_all_6)) # get the no of params
## [1] 117
summary(bike_mod_all_6)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9481845
While decreasing the number of predictors from 305 to 117, the backwards step search also resulted in a very small improvement in adjusted \(R^2\).
calc_loocv_rmse(bike_mod_all_6) # get the loocv rmse
## [1] 470.9009
The LOOCV-RMSE also significantly improved, which indicates that this smaller model is a much better model for inference.
par(mfrow = c(2, 2))
plot(bike_mod_all_6,col = 'dodgerblue') # do diagnostics
The Residuals vs Fitted plot still looks normal.
We have previously seen some issues which indicated that polynomial features might improve the model. We will now evaluate including them:
temp_mod = lm(cnt ~ .+I(temp^2)+I(hum^2)+I(windspeed^2),
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_7=step(temp_mod, trace=0, direction="backward")
summary(bike_mod_all_7)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9203027
The adjusted \(R^2\) is lower than it was for our interaction model.
calc_loocv_rmse(bike_mod_all_7) # get the loocv rmse
## [1] 548.4574
The LOOCV-RMSE has also increased.
par(mfrow = c(2, 2))
plot(bike_mod_all_7,col = 'dodgerblue') # do diagnostics
The residual vs fitted plot does not look random and shows some non linear pattern, which indicate that the inclusion of these terms has not improved the model. However \(temp^2\) has a very low p-value which indicates that it may be useful. We will keep this in mind.
We will now evaluate taking a log transformation of the response.
temp_m = lm(log(cnt) ~.^2,
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_8=step(temp_m, trace=0, direction="backward")
summary(bike_mod_all_8)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9378731
The adjusted \(R^2\) has been lowered by this transformation.
par(mfrow = c(2, 2))
plot(bike_mod_all_8,col = 'dodgerblue') # do diagnostics
In addition we see issues in both the Residuals vs Fitted plot and the Normal Q-Q plot. We can conclude that this transformation was not helpful.
Since the target is a count variable, we will evaluate a Poisson regression.
poisson_mod = glm(cnt ~ (. ^2),
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance), family=poisson)
bike_mod_all_9 = step(poisson_mod, trace=0, direction="backward")
Since Poisson models apply transformations we will need to do a bit of work to estimate the LOOCV-RMSE:
filter = cokks_distance <= 4 / length(cokks_distance)
e <- data_3$cnt[filter] - bike_mod_all_9$fitted.values
temp=(e / (1 - hatvalues(bike_mod_all_9)))^2
temp=temp[is.finite(temp)]
sqrt(mean(temp))
## [1] 629.2086
While we are not confident that this LOOCV-RMSE value is correct given the internal transformations performed by Poisson GLMs, the value is in line with those produced by earlier models and is significantly higher than our best models.
par(mfrow = c(2, 2))
plot(bike_mod_all_9,col = 'dodgerblue') # do diagnostics
While we see some issues with the lower tail of the Normal Q-Q plot the Residuals vs Fitted plot indicates no issues.
Due to the lack of available metrics to compare the Poisson regressor with our other models, we will continue with the non-GLM models.
Finally, since some of the polynomial terms seemed as if they could be significant, we will try a model which includes interactions and polynomial terms:
bike_mod_int_poly = lm(cnt ~(. ^ 2) + I(temp^2) + I(hum^2) + I(windspeed^2),
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_10 = step(bike_mod_int_poly, trace=0, direction="backward")
summary(bike_mod_all_10)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9497541
The adjusted \(R^2\) is the best we have found yet.
calc_loocv_rmse(bike_mod_all_10) # get the loocv rmse
## [1] 461.4496
And this model also results in the lowest LOOCV-RMSE.
Finally we will refit the model using the full data set, find the observations with high leverage and refit the model excluding those items:
bike_mod_int_poly_full = lm(cnt ~(. ^ 2) + I(temp^2) + I(hum^2) + I(windspeed^2),
data = data_3)
bike_mod_all_11 = step(bike_mod_int_poly_full, trace=0, direction="backward")
cooks_distance = cooks.distance(bike_mod_all_11) # find influential observations and exclude them
filter = cooks_distance <= (4 / length(cooks_distance))
# some points have a cooks distance of NA, we will consider these to be outliers
filter[is.na(filter)] <- FALSE
bike_mod_int_poly_full = lm(cnt ~(. ^ 2) + I(temp^2) + I(hum^2) + I(windspeed^2),
data = data_3,
subset = filter)
bike_mod_all_11 = step(bike_mod_int_poly_full, trace=0, direction="backward")
summary(bike_mod_all_11)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9581752
calc_loocv_rmse(bike_mod_all_11) # get the loocv rmse
## [1] 426.8036
Finding and removing the observations with high leverage has improved the LOOCV-RMSE and the \(R^2\).
Our best model included both interactions and polynomial terms.
summary(bike_mod_all_11)$coef
## Estimate Std. Error t value
## (Intercept) -1683.157857 628.16807 -2.67947056
## seasonsummer 1007.282902 514.22911 1.95882123
## seasonfall 5346.488321 1355.39766 3.94459021
## seasonwinter 1875.157098 977.29943 1.91871297
## yr 1571.808081 290.91020 5.40306968
## mnthFeb -36.596132 426.16630 -0.08587289
## mnthMar -874.913934 487.62426 -1.79423792
## mnthApr -1649.057717 850.64501 -1.93859683
## mnthMay 1594.909994 1098.35947 1.45208380
## mnthJun 5073.201019 1479.97327 3.42790044
## mnthJul 7675.642562 2280.34604 3.36599904
## mnthAug 576.367356 2076.57684 0.27755648
## mnthSep -233.984044 1569.61641 -0.14907084
## mnthOct -248.871111 1256.54949 -0.19805914
## mnthNov -406.179652 1189.33609 -0.34151797
## mnthDec -947.636192 981.89091 -0.96511352
## holidayyes -4186.149391 1605.72061 -2.60702228
## weekdayMon 205.959122 138.53667 1.48667586
## weekdayTue 349.335237 136.33276 2.56237199
## weekdayWed 404.042478 140.25066 2.88085977
## weekdayThu 236.222443 139.94812 1.68792861
## weekdayFri 430.350153 141.74698 3.03604465
## weekdaySat -16.410314 130.45377 -0.12579409
## weathersitMisty 551.344014 350.45802 1.57320989
## weathersitLightPrecip -14720.916520 155712.09794 -0.09453932
## temp 7183.427951 1596.36499 4.49986562
## hum 3899.353693 1665.27408 2.34156872
## windspeed 8690.056734 1838.65179 4.72631999
## I(temp^2) -5972.663959 2799.15244 -2.13374015
## I(hum^2) -2106.785083 1306.74444 -1.61223956
## I(windspeed^2) -9860.844181 2280.57188 -4.32384714
## seasonsummer:yr 1285.262946 264.17642 4.86516910
## seasonfall:yr 533.058497 326.77930 1.63124928
## seasonwinter:yr 659.955073 313.89887 2.10244491
## seasonsummer:holidayyes -872.834260 771.16033 -1.13184538
## seasonfall:holidayyes -1920.061232 1083.21910 -1.77255112
## seasonsummer:weekdayMon -719.252826 174.75416 -4.11579812
## seasonfall:weekdayMon -355.732004 172.02576 -2.06789963
## seasonwinter:weekdayMon -242.071372 174.39260 -1.38808282
## seasonsummer:weekdayTue -647.481007 170.66887 -3.79378494
## seasonfall:weekdayTue -160.705969 165.72005 -0.96974365
## seasonwinter:weekdayTue 9.148958 172.05279 0.05317529
## seasonsummer:weekdayWed -710.700517 165.60444 -4.29155482
## seasonfall:weekdayWed -302.930356 169.48059 -1.78740442
## seasonwinter:weekdayWed -209.656446 173.69328 -1.20704985
## seasonsummer:weekdayThu -476.993276 167.14545 -2.85376173
## seasonfall:weekdayThu -252.243035 166.94572 -1.51092845
## seasonwinter:weekdayThu -107.387186 177.16770 -0.60613299
## seasonsummer:weekdayFri -408.566665 170.47384 -2.39665308
## seasonfall:weekdayFri -308.933488 165.82517 -1.86300718
## seasonwinter:weekdayFri -229.443635 170.85439 -1.34291917
## seasonsummer:weekdaySat 231.030011 165.28014 1.39780862
## seasonfall:weekdaySat 263.169325 163.59916 1.60862268
## seasonwinter:weekdaySat 371.076020 164.70581 2.25296254
## seasonsummer:weathersitMisty -722.271467 329.65371 -2.19100058
## seasonfall:weathersitMisty -563.891641 386.74521 -1.45804426
## seasonwinter:weathersitMisty -627.249044 330.50239 -1.89786538
## seasonfall:weathersitLightPrecip -1223.564090 28423.50219 -0.04304762
## seasonsummer:temp -4982.990491 1674.57193 -2.97568017
## seasonfall:temp -10478.551306 2520.37862 -4.15753063
## seasonwinter:temp -4927.242674 1542.84491 -3.19360853
## seasonsummer:hum 2510.784980 1053.21752 2.38391873
## seasonfall:hum 1984.759437 1633.72187 1.21486985
## seasonwinter:hum 2244.421218 1554.85212 1.44349497
## yr:mnthFeb -23.032973 176.95678 -0.13016157
## yr:mnthMar 161.433894 214.44466 0.75279980
## yr:mnthApr -933.972945 348.39643 -2.68077645
## yr:mnthMay -1730.266988 378.14572 -4.57566195
## yr:mnthJun -1460.191848 392.56008 -3.71966464
## yr:mnthJul -906.094655 447.75498 -2.02363950
## yr:mnthAug -394.949341 430.67504 -0.91704721
## yr:mnthSep -336.952954 403.58612 -0.83489729
## yr:mnthOct -25.875650 382.07223 -0.06772450
## yr:mnthNov -519.130822 374.19104 -1.38734166
## yr:mnthDec -600.743332 317.58216 -1.89161550
## yr:weekdayMon 447.472396 119.09175 3.75737534
## yr:weekdayTue 357.617298 119.33159 2.99683673
## yr:weekdayWed 427.096412 121.77112 3.50737041
## yr:weekdayThu 655.717206 121.01662 5.41840634
## yr:weekdayFri 525.159691 118.57263 4.42901264
## yr:weekdaySat 583.689956 117.20206 4.98020202
## yr:weathersitMisty -289.958296 94.11826 -3.08078680
## yr:temp 1712.948932 558.46150 3.06726412
## yr:hum -968.212312 371.60507 -2.60548734
## yr:windspeed -942.668377 487.55091 -1.93347683
## mnthFeb:weathersitMisty -113.544874 210.17913 -0.54022907
## mnthMar:weathersitMisty -165.056877 263.32604 -0.62681562
## mnthApr:weathersitMisty -344.205723 419.60135 -0.82031604
## mnthMay:weathersitMisty 405.748885 428.55351 0.94678698
## mnthJun:weathersitMisty -286.810348 446.77550 -0.64195631
## mnthJul:weathersitMisty -330.613027 503.78410 -0.65625935
## mnthAug:weathersitMisty -75.573267 490.25090 -0.15415222
## mnthSep:weathersitMisty -262.344593 444.44444 -0.59027534
## mnthOct:weathersitMisty -121.263680 406.18655 -0.29854184
## mnthNov:weathersitMisty 403.136037 387.55104 1.04021405
## mnthDec:weathersitMisty 544.194050 340.07966 1.60019580
## mnthOct:weathersitLightPrecip -2160.039406 29281.33694 -0.07376847
## mnthFeb:temp -537.529140 1124.60346 -0.47797215
## mnthMar:temp 3187.552028 1393.64365 2.28720738
## mnthApr:temp 8776.005504 2373.96634 3.69676913
## mnthMay:temp 7050.072966 2851.35326 2.47253578
## mnthJun:temp 1241.808461 3251.55191 0.38191254
## mnthJul:temp -2171.924595 4051.26205 -0.53611062
## mnthAug:temp 7227.527231 3819.78852 1.89212759
## mnthSep:temp 9624.696390 3315.19895 2.90320325
## mnthOct:temp 7343.714804 2337.15085 3.14216552
## mnthNov:temp 4487.196466 2367.00697 1.89572592
## mnthDec:temp 5835.861100 1806.21144 3.23099553
## mnthFeb:hum 646.600316 723.38478 0.89385392
## mnthMar:hum 382.660731 828.30867 0.46197842
## mnthApr:hum -1073.359121 1289.66474 -0.83227764
## mnthMay:hum -4150.273136 1335.75544 -3.10706063
## mnthJun:hum -3742.574846 1339.23320 -2.79456545
## mnthJul:hum -3628.858102 1832.90232 -1.97984261
## mnthAug:hum -3939.536157 1854.49777 -2.12431432
## mnthSep:hum -4590.272225 1874.44235 -2.44887351
## mnthOct:hum -2888.400625 1841.22892 -1.56873520
## mnthNov:hum -1784.941857 1767.63854 -1.00978895
## mnthDec:hum -1577.333641 1554.63469 -1.01460083
## holidayyes:hum 6725.198944 3288.03915 2.04535245
## weekdayMon:weathersitMisty 200.222516 135.29996 1.47984162
## weekdayTue:weathersitMisty -69.502245 135.04967 -0.51464211
## weekdayWed:weathersitMisty -47.747137 135.91327 -0.35130593
## weekdayThu:weathersitMisty 29.179093 134.83998 0.21639794
## weekdayFri:weathersitMisty -159.463510 136.16194 -1.17113126
## weekdaySat:weathersitMisty -162.241594 137.11107 -1.18328590
## weekdayMon:weathersitLightPrecip 1433.004460 3176.29814 0.45115553
## weathersitMisty:temp 1859.348438 556.16869 3.34313756
## weathersitLightPrecip:temp 27925.347443 340979.06826 0.08189754
## weathersitMisty:hum -1743.947562 543.30563 -3.20988312
## hum:windspeed -9939.237300 1881.86830 -5.28157965
## Pr(>|t|)
## (Intercept) 7.604432e-03
## seasonsummer 5.066067e-02
## seasonfall 9.075035e-05
## seasonwinter 5.556053e-02
## yr 9.935915e-08
## mnthFeb 9.316001e-01
## mnthMar 7.334826e-02
## mnthApr 5.308403e-02
## mnthMay 1.470734e-01
## mnthJun 6.558348e-04
## mnthJul 8.182975e-04
## mnthAug 7.814617e-01
## mnthSep 8.815548e-01
## mnthOct 8.430752e-01
## mnthNov 7.328498e-01
## mnthDec 3.349306e-01
## holidayyes 9.392048e-03
## weekdayMon 1.376985e-01
## weekdayTue 1.067277e-02
## weekdayWed 4.127438e-03
## weekdayThu 9.201658e-02
## weekdayFri 2.515574e-03
## weekdaySat 8.999429e-01
## weathersitMisty 1.162704e-01
## weathersitLightPrecip 9.247167e-01
## temp 8.373268e-06
## hum 1.957449e-02
## windspeed 2.937482e-06
## I(temp^2) 3.332514e-02
## I(hum^2) 1.075085e-01
## I(windspeed^2) 1.834043e-05
## seasonsummer:yr 1.512655e-06
## seasonfall:yr 1.034351e-01
## seasonwinter:yr 3.598810e-02
## seasonsummer:holidayyes 2.582142e-01
## seasonfall:holidayyes 7.688075e-02
## seasonsummer:weekdayMon 4.477372e-05
## seasonfall:weekdayMon 3.913686e-02
## seasonwinter:weekdayMon 1.656983e-01
## seasonsummer:weekdayTue 1.655685e-04
## seasonfall:weekdayTue 3.326191e-01
## seasonwinter:weekdayTue 9.576124e-01
## seasonsummer:weekdayWed 2.111696e-05
## seasonfall:weekdayWed 7.444669e-02
## seasonwinter:weekdayWed 2.279544e-01
## seasonsummer:weekdayThu 4.490447e-03
## seasonfall:weekdayThu 1.314060e-01
## seasonwinter:weekdayThu 5.446874e-01
## seasonsummer:weekdayFri 1.689300e-02
## seasonfall:weekdayFri 6.301736e-02
## seasonwinter:weekdayFri 1.798762e-01
## seasonsummer:weekdaySat 1.627586e-01
## seasonfall:weekdaySat 1.082977e-01
## seasonwinter:weekdaySat 2.467196e-02
## seasonsummer:weathersitMisty 2.888888e-02
## seasonfall:weathersitMisty 1.454240e-01
## seasonwinter:weathersitMisty 5.826000e-02
## seasonfall:weathersitLightPrecip 9.656799e-01
## seasonsummer:temp 3.057572e-03
## seasonfall:temp 3.754637e-05
## seasonwinter:temp 1.489036e-03
## seasonsummer:hum 1.748238e-02
## seasonfall:hum 2.249598e-01
## seasonwinter:hum 1.494752e-01
## yr:mnthFeb 8.964883e-01
## yr:mnthMar 4.519064e-01
## yr:mnthApr 7.575227e-03
## yr:mnthMay 5.925594e-06
## yr:mnthJun 2.208904e-04
## yr:mnthJul 4.351140e-02
## yr:mnthAug 3.595374e-01
## yr:mnthSep 4.041539e-01
## yr:mnthOct 9.460306e-01
## yr:mnthNov 1.659240e-01
## yr:mnthDec 5.909025e-02
## yr:weekdayMon 1.908701e-04
## yr:weekdayTue 2.856506e-03
## yr:weekdayWed 4.912061e-04
## yr:weekdayThu 9.161609e-08
## yr:weekdayFri 1.151713e-05
## yr:weekdaySat 8.622295e-07
## yr:weathersitMisty 2.172360e-03
## yr:temp 2.271257e-03
## yr:hum 9.433685e-03
## yr:windspeed 5.371271e-02
## mnthFeb:weathersitMisty 5.892673e-01
## mnthMar:weathersitMisty 5.310515e-01
## mnthApr:weathersitMisty 4.124068e-01
## mnthMay:weathersitMisty 3.441812e-01
## mnthJun:weathersitMisty 5.211807e-01
## mnthJul:weathersitMisty 5.119437e-01
## mnthAug:weathersitMisty 8.775487e-01
## mnthSep:weathersitMisty 5.552591e-01
## mnthOct:weathersitMisty 7.654073e-01
## mnthNov:weathersitMisty 2.987175e-01
## mnthDec:weathersitMisty 1.101544e-01
## mnthOct:weathersitLightPrecip 9.412226e-01
## mnthFeb:temp 6.328682e-01
## mnthMar:temp 2.257921e-02
## mnthApr:temp 2.412296e-04
## mnthMay:temp 1.373035e-02
## mnthJun:temp 7.026802e-01
## mnthJul:temp 5.921083e-01
## mnthAug:temp 5.902186e-02
## mnthSep:temp 3.848437e-03
## mnthOct:temp 1.771322e-03
## mnthNov:temp 5.854311e-02
## mnthDec:temp 1.310630e-03
## mnthFeb:hum 3.718081e-01
## mnthMar:hum 6.442874e-01
## mnthApr:hum 4.056292e-01
## mnthMay:hum 1.991438e-03
## mnthJun:hum 5.386284e-03
## mnthJul:hum 4.824101e-02
## mnthAug:hum 3.410881e-02
## mnthSep:hum 1.465508e-02
## mnthOct:hum 1.173099e-01
## mnthNov:hum 3.130598e-01
## mnthDec:hum 3.107617e-01
## holidayyes:hum 4.131616e-02
## weekdayMon:weathersitMisty 1.395130e-01
## weekdayTue:weathersitMisty 6.070189e-01
## weekdayWed:weathersitMisty 7.254993e-01
## weekdayThu:weathersitMisty 8.287613e-01
## weekdayFri:weathersitMisty 2.420753e-01
## weekdaySat:weathersitMisty 2.372294e-01
## weekdayMon:weathersitLightPrecip 6.520630e-01
## weathersitMisty:temp 8.872392e-04
## weathersitLightPrecip:temp 9.347593e-01
## weathersitMisty:hum 1.408783e-03
## hum:windspeed 1.876616e-07
It uses 151 predictors, which is quite a lot, and the p-values for many of them indicate that they may not be significant. However, it appears that the non-significant terms are levels for factor variables which may be difficult to remove.
plot(bike_mod_all_11$fitted.values, data_3$cnt[filter], main = "Fitted Values vs Actual", xlab = "Fitted Values", ylab = "Ground Truth", col = "darkblue", pch = 20)
abline(0, 1, col = "firebrick4", lwd = 3)
We also see that the fitted value for this model are quite close to the actual values.
summary(bike_mod_all_11)$adj.r
## [1] 0.9581752
The adjusted \(R^2\) indicates that the model explains 0.9581752 of the variance in the data.
The LOOCV-RMSE of the model is 426.8036309.
par(mfrow = c(2, 2))
plot(bike_mod_all_11,col = 'dodgerblue') # do diagnostics
The diagnostic plots all appear to be acceptable, although there are still some points with very high leverage.
sqrt(mean(bike_mod_all_11$residuals^2))
## [1] 343.8317
The model has an rmse of 343.8317181 which means that on an average the model predictions are off by this much amount.
# Count the number of high VIF's in the model
sum(vif(bike_mod_all_11) > 10)
## [1] 96
This was expected due to high number of categorical variables and their interactions. Also since we are mainly focusing on demand prediction, we tuned our model for better prediction by sacrificing interpretability.
hist(resid(bike_mod_all_11), col = "lightslateblue", main = "Histogram of Residuals", xlab = "Residuals")
The histogram of residuals looks nearly normal which confirms the fact that the model has noise which are normally distributed and centered about 0.
So far we have seen that we tried multiple approaches to evolve our understanding of what should work to have a model with a good performance without sacrificing the high variance nature of the model and looks like we have reached a decent spot where we not only have a model with high adjusted \(R^2\) value but also a low loocv rmse.
While the final model contained a large number of predictors, making it difficult to interpret, the results indicate that it is very good at explaining the relationship between weather, season and bike sharing rentals. We tried evaluating using BIC to reduce the size of the final model, however doing so resulted in a lowered LOOCV-RMSE, so we preferred the AIC selected model.
As expected, the weather situation is an especially important predictor, both by itself and in its interactions with other predictors, indicating that rain has a significant impact on the number of rentals, especially the interaction between light rain and windspeed.
We have some concerns about the inclusion of the year as a predictor as it may cause problems with using the model for inference on unseen data. While initially it was used as a categorical predictor, we changed it to a numeric predictor in the hopes that it would make the model generalize better to data from the future, should that ever be required. We attempted to model the data excluding year, however the general increasing ridership between the years proved to be an important factor. Future work could incorporate data from additional years to see if this trend continues.
The high adjusted \(R^2\) of the model indicates that a very large portion of the variance in the data can be explained by this model, which would make it very useful for predicting demand for bikes.
While we feel that a Poisson model would lend itself nicely to this data, the lack of available metrics to compare Poisson models with normal linear models made it prohibitively complicated to pursue, and the excellent performance of the non-Poisson models reduced the importance of pursuing Poisson models.
This dataset proved to be multifarious and was very interesting to work with. However, due to time constraints, we were not able to do as complete an analysis of the data as we would have liked.
While our model achieved impressive results, based on the distribution of data between demand and type of day, as well as from the outlier detection through Cook’s method, it might be suitable to build separate models for Holidays, Workdays and Weekends.
Our model could also be extended to apply Seasonality analysis to see how season and days of week affect the response. As we had concerns about the inclusion of the year as a predictor given the limited amount of training data, if data for additional years is available, future work could also analyze the overall demand trend year after year after controlling for seasonal components in the data.
Finally, as was previously mentioned, we focused on accurate prediction at the expense of interpretability. Future work could attempt to find models which are more easily interpreted.
data = read.csv("day.csv")
summary(data$cnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22 3152 4548 4504 5956 8714
hist(data$cnt, col = "lightslateblue", main = "Histogram of Total Ridership Count", xlab = "Total Ridership Count")
It seems sort of normally distributed, although since it is a count it should follow a Poisson distribution rather than normal.
pairs(data[,10:16], main = "Pairs Plot for Numeric Features")
It looks like there are relationships between temp and cnt, although not necessarily linear. The other relationships are not so clear.
cor(data[,10:16])
## temp atemp hum windspeed casual
## temp 1.0000000 0.9917016 0.12696294 -0.1579441 0.54328466
## atemp 0.9917016 1.0000000 0.13998806 -0.1836430 0.54386369
## hum 0.1269629 0.1399881 1.00000000 -0.2484891 -0.07700788
## windspeed -0.1579441 -0.1836430 -0.24848910 1.0000000 -0.16761335
## casual 0.5432847 0.5438637 -0.07700788 -0.1676133 1.00000000
## registered 0.5400120 0.5441918 -0.09108860 -0.2174490 0.39528245
## cnt 0.6274940 0.6310657 -0.10065856 -0.2345450 0.67280443
## registered cnt
## temp 0.5400120 0.6274940
## atemp 0.5441918 0.6310657
## hum -0.0910886 -0.1006586
## windspeed -0.2174490 -0.2345450
## casual 0.3952825 0.6728044
## registered 1.0000000 0.9455169
## cnt 0.9455169 1.0000000
summary(data[,10:13])
## temp atemp hum windspeed
## Min. :0.05913 Min. :0.07907 Min. :0.0000 Min. :0.02239
## 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200 1st Qu.:0.13495
## Median :0.49833 Median :0.48673 Median :0.6267 Median :0.18097
## Mean :0.49538 Mean :0.47435 Mean :0.6279 Mean :0.19049
## 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302 3rd Qu.:0.23321
## Max. :0.86167 Max. :0.84090 Max. :0.9725 Max. :0.50746
par(mfrow=c(2,2))
hist(data[,10], main="Temp", xlab = "Temperature (Celsius)", col = "tomato")
hist(data[,11], main="ATemp", xlab = "Normalized Temperature", col = "tomato2")
hist(data[,12], main="Humidity", xlab = "Humidity", col = "turquoise1")
hist(data[,13], main="Windspeed", xlab = "Windspeed", col = "skyblue")
These do not appear to be normally distributed, although Windspeed could possibly be turned into normal distributions with transformations and Humidity is fairly close to normal.
par(mfrow=c(1,3))
barplot(prop.table(table(data$season)), col = 1:4, main = "Distribution of Season", xlab = "Season", ylab = "Count")
barplot(prop.table(table(data$mnth)), col = 1:12, main = "Distribution of Months", xlab = "Month", ylab = "Count")
barplot(prop.table(table(data$weekday)), col = 1:12, main = "Distribution of Weekdays", xlab = "Weekday?", ylab = "Count")
barplot(prop.table(table(data$holiday)), col = 6:12, main = "Distribution of Holidays", xlab = "Holiday", ylab = "Count")
barplot(prop.table(table(data$workingday)), col = 8:12, main = "Distribution of Working Days", xlab = "Working Day", ylab = "Count")
barplot(prop.table(table(data$weathersit)), col = 10:12, main = "Distribution of Weather Types", xlab = "Weather Type", ylab = "Count")
It appears that Months, Seasons and Weekdays are distributed uniformly. However, there are very few holidays, more working days than non-working days and the weather seems to be mostly clear.
# create a color palette
palette(brewer.pal(n = 12, name = "Set3"))
par(mfrow=c(1,3))
barplot(aggregate(x = data$cnt, by = list(data$mnth), FUN = sum)$x, names = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), main = "Total Ridership By Month", col = 1:12)
palette(brewer.pal(n = 4, name = "Accent"))
barplot(aggregate(x = data$cnt, by = list(data$season), FUN = sum)$x, names = c("spring", "summer", "fall", "winter"), main = "Total Ridership By Season", col = 1:4)
palette(brewer.pal(n = 4, name = "Set1"))
barplot(aggregate(x = data$cnt, by = list(data$weathersit), FUN = sum)$x, names = c("Clearish", "Misty", "LightPrecip"), main = "Total Ridership By Weather Type", col = 1:4)
There appears to be relationships between the Month, Season and Weather and Total Ridership. However the Total Ridership by Weather Type resembles the distribution of Weather, so this relationship may not be so important.
par(mfrow=c(1,3))
palette(brewer.pal(n = 4, name = "Set3"))
barplot(aggregate(x = data$cnt, by = list(data$holiday), FUN = sum)$x, names = c("No", "Yes"), main = "Total Ridership By Holiday", col = 1:12)
palette(brewer.pal(n = 4, name = "Accent"))
barplot(aggregate(x = data$cnt, by = list(data$workingday), FUN = sum)$x, names = c("No", "Yes"), main = "Total Ridership By Workingday", col = 1:4)
palette(brewer.pal(n = 7, name = "Set2"))
barplot(aggregate(x = data$cnt, by = list(data$weekday), FUN = sum)$x, names = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), main = "Total Ridership By Weekday", col = 1:7)
While there is a higher ridership on non-holidays and workingdays, these very closely resemble the distributions of Holidays and Workingdays in the data, so these may not be very useful as predictors.
par(mfrow=c(1,3))
palette(brewer.pal(n = 6, name = "Dark2"))
plot(data$temp, data$cnt, main = "Temp vs Total Ridership", xlab = "Temp", ylab = "Total Ridership", col = 1)
plot(data$hum, data$cnt, main = "Humidity vs Total Ridership", ylab = "Total Ridership", xlab = "Humidity", col = 2)
plot(data$windspeed, data$cnt, main = "Windspeed vs Total Ridership", ylab = "Total Ridership", xlab = "Windspeed", col = 3)
It seems that there is a somewhat linear relationship between temperature and total ridership, although it may be more polynomial. It is not clear what the relationships between Humidity and Windspeed and Ridership are.
par(mfrow=c(1,3))
plot(data$windspeed, data$hum, main = "Windspeed vs Humidity", ylab = "Humidity", xlab = "Windspeed", col = 4)
plot(data$windspeed, data$temp, main = "Windspeed vs Temp", ylab = "Temp", xlab = "Windspeed", col = 5)
plot(data$hum, data$temp, main = "Humidity vs Temp", ylab = "Temp", xlab = "Humidity", col = 6)
It seems as if these numerical features may be related somehow, although they do not seem to be related linearly.
First we will look at box plots of Total Ridership by the categorical features to see if anything stands out as worth exploring:
palette(brewer.pal(n = 12, name = "Set3"))
par(mfrow = c(1,2))
boxplot(cnt ~ weathersit, data = data, col = 1:4, main = "Count by Weather", ylab = "Total Ridership")
boxplot(cnt ~ season, data = data, col = 5:8, main = "Count by Season", ylab = "Total Ridership")
boxplot(cnt ~ mnth, data = data, col = 1:12, main = "Count by Month", ylab = "Total Ridership")
boxplot(cnt ~ weekday, data = data, col = 1:7, main = "Count by Weekday", ylab = "Total Ridership")
boxplot(cnt ~ workingday, data = data, col = 8:10, main = "Count by Working Day", ylab = "Total Ridership")
boxplot(cnt ~ holiday, data = data, col = 10:12, main = "Count by Holiday", ylab = "Total Ridership")
It appears that Weather, Season and Month may be worth further exploration, Weekday and Working Day don’t seem to be of much interest, and while the mean Ridership on Holidays is lower than on non-Holidays, it is not significantly lower.
We will now look at some interactions between the categorical features we identified above and some numerical features.
par(mfrow=c(1,2))
palette(brewer.pal(n = 4, name = "Set1"))
plot(data$temp, data$cnt, col = data$season, pch = 20, main = "Ridership vs Temp by Season", xlab = "Temperature", ylab = "Ridership")
legend("topleft", legend = c("Spring", "Summer", "Fall", "Winter"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)
plot(data$hum, data$cnt, col = data$season, pch = 20, main = "Ridership vs Humidity by Season", xlab = "Humidity", ylab = "Ridership")
legend("topleft", legend = c("Spring", "Summer", "Fall", "Winter"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)
As we would expect the temperature to be correlated to the season the Ridership vs Temperature plot doesn’t show any interesting patterns. However, the Ridership vs Humidity plot has vertical clusters which could indicate interesting correlations.
par(mfrow=c(1,2))
palette(brewer.pal(n = 4, name = "Set2"))
plot(data$temp, data$cnt, col = data$weathersit, pch = 20, main = "Ridership vs Temp by Weather", xlab = "Temperature", ylab = "Ridership")
legend("topleft", legend = c("Clear", "Misty", "Light Precip"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)
plot(data$hum, data$cnt, col = data$weathersit, pch = 20, main = "Ridership vs Humidity by Weather", xlab = "Humidity", ylab = "Ridership")
legend("topleft", legend = c("Clear", "Misty", "Light Precip"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)
Similarly, as we would expect Humidity to be correlated with Weather the Ridership vs Humidity plot doesn’t show any interesting patterns while the Ridership vs Temperature plot indicates that there may be some
par(mfrow = c(1,2))
palette(brewer.pal(n = 12, name = "Set3"))
plot(data$temp, data$cnt, col = data$mnth, pch = 20, main = "Ridership vs Temp by Month", xlab = "Temp", ylab = "Ridership")
plot(data$hum, data$cnt, col = data$mnth, pch = 20, main = "Ridership vs Humidity by Month", xlab = "Temp", ylab = "Ridership")
legend("topleft", legend = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), col = 1:12, lwd = 1, lty = c(0,0), pch = 20)
It is difficult to draw conclusions from the plots above, however both interaction may merit further investigation.